OpenVINO integration for CausalLM models #17

helena-intel · 2023-12-12T15:44:39Z

OpenVINO integration for text-generation-inference.

Known limitations:

Seq2Seq models are not supported yet in this integration. This will be added later.
Only CPU device is supported at the moment. I want to test and add GPU support in a later PR. Should I add an environment variable OPENVINO_DEVICE or is there a better way?

It would be great to have a documented option to build the Docker image without GPU dependencies and flash-attention, maybe with a make cpubuild option for example. make build-test-image works fine with this integration.

AlexKoff88 · 2024-01-22T07:39:15Z

server/text_generation_server/inference_engine/hf_optimum_ov.py

+        model_path: str,
+        model_class: Union[AutoModelForCausalLM, AutoModelForSeq2SeqLM],
+        dtype: torch.dtype,
+        quantize: Optional[str],  # not used by OpenVINO


Why not to consider quantize parameter as a trigger to compress the model weights to INT8 or INT4?

quantize is currently used for bitsandbytes and GPTQ and using anything else throws an error. We could presumably modify that, but for weight compression it seemed load_in_8bit (which is now the default) and soon load_in_4bit would be a better fit.

Personally I would always compress offline and load the compressed model directly. TGIS requires downloaded weights. So if you want to compress the model on the fly, you would have to download the full precision weights, keep them on disk, and then within TGIS compress the model to 4 or 8 bit every time, which takes several minutes.

Thanks @helena-intel. I agree that offline compression is better. But I noticed you allow on-fly conversion here based only logic below where you add a flag kwargs["export"] = True when it is not model_is_ov. That is why I ask about on-fly compression for such models as well.

It's a good point! Note that at the moment we already do on-the-fly compression for models with more than 1B parameters because they will be converted to 8-bit by optimum-intel. But it would be good to allow to configure this, especially now that we'll have load_in_4bit in optimum-intel soon. How do you propose to include this? Add a "weight_compression" option for quantize in addition to bitsandbytes and gptq? Or weight_compression_int4 and weight_compression_int8? Currently setting dtype_str to int8 also enables bitsandbytes quantization, so I thought we could use the same, but that doesn't out of the box allow int4 because dtype-str is limited to torch dtypes. That can all be changed, but I would like to get a maintainer's opinion on the best way to do this first.

Another option could be to add an environment variable OPENVINO_WEIGHT_FORMAT and allow specifying an exact config for sym/asym, group size and ratio. Which is the most flexible, but a different API than other inference engines.

Signed-off-by: Helena <[email protected]>

helena-intel force-pushed the openvino-support-causallm branch from 0692b1c to 76a44fa Compare December 12, 2023 15:47

AlexKoff88 reviewed Jan 22, 2024

View reviewed changes

OpenVINO integration for CausalLM models

16fc318

Signed-off-by: Helena <[email protected]>

helena-intel force-pushed the openvino-support-causallm branch from 76a44fa to 6349d91 Compare February 2, 2024 13:40

Update to OpenVINO 2023.3, stateful model support

3fc754e

helena-intel force-pushed the openvino-support-causallm branch from 6349d91 to 3fc754e Compare February 2, 2024 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenVINO integration for CausalLM models #17

OpenVINO integration for CausalLM models #17

helena-intel commented Dec 12, 2023

AlexKoff88 Jan 22, 2024

helena-intel Jan 22, 2024

AlexKoff88 Jan 22, 2024

helena-intel Feb 5, 2024

OpenVINO integration for CausalLM models #17

Are you sure you want to change the base?

OpenVINO integration for CausalLM models #17

Conversation

helena-intel commented Dec 12, 2023

AlexKoff88 Jan 22, 2024

Choose a reason for hiding this comment

helena-intel Jan 22, 2024

Choose a reason for hiding this comment

AlexKoff88 Jan 22, 2024

Choose a reason for hiding this comment

helena-intel Feb 5, 2024

Choose a reason for hiding this comment